120 research outputs found

    Finding an unknown number of multivariate outliers

    Get PDF
    We use the forward search to provide robust Mahalanobis distances to detect the presence of outliers in a sample of multivariate normal data. Theoretical results on order statistics and on estimation in truncated samples provide the distribution of our test statistic. We also introduce several new robust distances with associated distributional results. Comparisons of our procedure with tests using other robust Mahalanobis distances show the good size and high power of our procedure. We also provide a unification of results on correction factors for estimation from truncated samples

    On characterizations and tests of Benford’s law

    Get PDF
    Benford's law defines a probability distribution for patterns of significant digits in real numbers. When the law is expected to hold for genuine observations, deviation from it can be taken as evidence of possible data manipulation. We derive results on a transform of the significand function that provide motivation for new tests of conformance to Benford's law exploiting its sum-invariance characterization. We also study the connection between sum invariance of the first digit and the corresponding marginal probability distribution. We approximate the exact distribution of the new test statistics through a computationally efficient Monte Carlo algorithm. We investigate the power of our tests under different alternatives and we point out relevant situations in which they are clearly preferable to the available procedures. Finally, we show the application potential of our approach in the context of fraud detection in international trade

    The Forward Search for Very Large Datasets

    Get PDF
    The identification of atypical observations and the immunization of data analysis against both outliers and failures of modeling are important aspects of modern statistics. The forward search is a graphics rich approach that leads to the formal detection of outliers and to the detection of model inadequacy combined with suggestions for model enhancement. The key idea is to monitor quantities of interest, such as parameter estimates and test statistics, as the model is fitted to data subsets of increasing size. In this paper we propose some computational improvements of the forward search algorithm and we provide a recursive implementation of the procedure which exploits the information of the previous step. The output is a set of efficient routines for fast updating of the model parameter estimates, which do not require any data sorting, and fast computation of likelihood contributions, which do not require matrix inversion or qr decomposition. It is shown that the new algorithms enable a reduction of the computation time by more than 80%. Furthemore, the running time now increases almost linearly with the sample size. All the routines described in this paper are included in the FSDA toolbox for MATLAB which is freely downloadable from the internet.JRC.G.2-Global security and crisis managemen

    Tempered positive Linnik processes and their representations

    Get PDF
    This paper analyzes various classes of processes associated with the tempered positive Linnik (TPL) distribution.We provide several subordinated representations of TPL LĂ©vy processes and in particular establish a stochastic self-similarity property with respect to negative binomial subordination. In finite activity regimes we show that the explicit compound Poisson representations give raise to innovations following Mittag-Leffler type laws which are apparently new. We characterize two time-inhomogeneous TPL processes, namely the Ornstein-Uhlenbeck (OU) LĂ©vy-driven processes with stationary distribution and the additive process determined by a TPL law. We finally illustrate how the properties studied come together in a multivariate TPL LĂ©vy framework based on a novel negative binomial mixing methodology. Some potential applications are outlined in the contexts of statistical anti-fraud and financial modelling

    Tempered positive Linnik processes and their representations

    Get PDF
    This paper analyzes various classes of processes associated with the tempered positive Linnik (TPL) distribution. We provide several subordinated representations of TPL LĂ©vy processes and in particular establish a stochastic self-similarity property with respect to negative binomial subordination. In finite activity regimes we show that the explicit compound Poisson representations give raise to innovations following Mittag-Leffler type laws which are apparently new. We characterize two time-inhomogeneous TPL processes, namely the Ornstein-Uhlenbeck (OU) LĂ©vy-driven processes with stationary distribution and the additive process determined by a TPL law. We finally illustrate how the properties studied come together in a multivariate TPL LĂ©vy framework based on a novel negative binomial mixing methodology. Some potential applications are outlined in the contexts of statistical anti-fraud and financial modelling

    Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library

    Get PDF
    We extend the capabilities of MixSim, a framework which is useful for evaluating the performance of clustering algorithms, on the basis of measures of agreement between data partitioning and flexible generation methods for data, outliers and noise. The peculiarity of the method is that data are simulated from normal mixture distributions on the basis of pre-specified synthesis statistics on an overlap measure, defined as a sum of pairwise misclassification probabilities. We provide new tools which enable us to control additional overlapping statistics and departures from homogeneity and sphericity among groups, together with new outlier contamination schemes. The output of this extension is a more flexible framework for generation of data to better address modern robust clustering scenarios in presence of possible contamination. We also study the properties and the implications that this new way of simulating clustering data entails in terms of coverage of space, goodness of fit to theoretical distributions, and degree of convergence to nominal values. We demonstrate the new features using our MATLAB implementation that we have integrated in the Flexible Statistics for Data Analysis (FSDA) toolbox for MATLAB. With MixSim, FSDA now integrates in the same environment state of the art robust clustering algorithms and principled routines for their evaluation and calibration. A spin off of our work is a general complex routine, translated from C language to MATLAB, to compute the distribution function of a linear combinations of non central χ2\chi ^2?2 random variables which is at the core of MixSim and has its own interest for many test statistics

    Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods

    Get PDF
    Deciding the number of clusters k is one of the most difficult problems in Cluster Analysis. For this purpose, complexity-penalized likelihood approaches have been introduced in model-based clustering, such as the well known BIC and ICL criteria. However, the classification/mixture likelihoods considered in these approaches are unbounded without any constraint on the cluster scatter matrices. Constraints also prevent traditional EM and CEM algorithms from being trapped in (spurious) local maxima. Controlling the maximal ratio between the eigenvalues of the scatter matrices to be smaller than a fixed constant c ≄ 1 is a sensible idea for setting such constraints. A new penalized likelihood criterion which takes into account the higher model complexity that a higher value of c entails, is proposed. Based on this criterion, a novel and fully automatized procedure, leading to a small ranked list of optimal (k; c) couples is provided. Its performance is assessed both in empirical examples and through a simulation study as a function of cluster overlap

    A new family of tempered distributions

    Get PDF
    Tempered distributions have received considerable attention, both from a theoretical point of view and in several important application fields. The most popular choice is perhaps the Tweedie model, which is obtained by tempering the Positive Stable distribution. Through tempering, we suggest a very flexible four-parameter family of distributions that generalizes the Tweedie model and that could be applied to data sets of non-negative observations with complex (and difficult to accommodate) features. We derive the main theoretical properties of our proposal, through which we show its wide application potential. We also embed our proposal within the theory of Lévy processes, thus providing a strengthened probabilistic motivation for its introduction. Furthermore, we derive a series expansion for the probability density function which allows us to develop algorithms for fitting the distribution to data. We finally provide applications to challenging real-world examples taken from international trade

    Newcomb–Benford law and the detection of frauds in international trade

    Get PDF
    The contrast of fraud in international trade is a crucial task of modern economic regulations. We develop statistical tools for the detection of frauds in customs declarations that rely on the Newcomb–Benford law for significant digits. Our first contribution is to show the features, in the context of a European Union market, of the traders for which the law should hold in the absence of fraudulent data manipulation. Our results shed light on a relevant and debated question, since no general known theory can exactly predict validity of the law for genuine empirical data. We also provide approximations to the distribution of test statistics when the Newcomb–Benford law does not hold. These approximations open the door to the development of modified goodness-of-fit procedures with wide applicability and good inferential properties
    • 

    corecore